Olympic Data

1 Import the data

  NOC Year Decade     ID First.Name                   Name Last.Name Sex Age
1 AFG 1960  1960s  59346   Mohammad   Mohammad Asif Khokan    Khokan   M  24
2 AFG 1960  1960s  59043       Faiz Faiz Mohammad Khakshar  Khakshar   M  18
3 AFG 1960  1960s 109486      Abdul     Abdul Hadi Shekaib   Shekaib   M  20
  Height Weight      BMI BMI.Category        Team Population       GDP    GDPpC
1    171     78 26.67487            3 Afghanistan    8996973 537777800 59.77319
2    162     52 19.81405            0 Afghanistan    8996973 537777800 59.77319
3    178     68 21.46194            2 Afghanistan    8996973 537777800 59.77319
        Games Season City     Sport                                   Event
1 1960 Summer Summer Roma Wrestling Wrestling Men's Middleweight, Freestyle
2 1960 Summer Summer Roma Wrestling    Wrestling Men's Flyweight, Freestyle
3 1960 Summer Summer Roma Athletics              Athletics Men's 100 metres
     Medal Medal.No.Yes
1 No Medal            0
2 No Medal            0
3 No Medal            0
 [ reached 'max' / getOption("max.print") -- omitted 3 rows ]
'data.frame':   151977 obs. of  24 variables:
 $ NOC         : Factor w/ 122 levels "AFG","ALB","AND",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Year        : int  1960 1960 1960 1960 1960 1960 1960 1960 1960 1960 ...
 $ Decade      : Factor w/ 6 levels "1960s","1970s",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ ID          : int  59346 59043 109486 59102 128736 29626 39922 106372 128736 58364 ...
 $ First.Name  : Factor w/ 14118 levels "","A","A.","Aadam",..: 8716 3731 64 599 64 11978 64 4634 64 8716 ...
 $ Name        : Factor w/ 74268 levels "  Gabrielle Marie \"Gabby\" Adcock (White-)",..: 48941 19066 218 3341 220 64832 215 23793 220 48946 ...
 $ Last.Name   : Factor w/ 47370 levels "","-)","-Alard)",..: 23228 23112 38893 23137 44908 13260 16633 37860 44908 22890 ...
 $ Sex         : Factor w/ 2 levels "F","M": 2 2 2 2 2 2 2 2 2 2 ...
 $ Age         : int  24 18 20 35 20 28 22 23 20 20 ...
 $ Height      : int  171 162 178 166 179 168 172 170 179 166 ...
 $ Weight      : num  78 52 68 66 75 73 70 58 75 62 ...
 $ BMI         : num  26.7 19.8 21.5 24 23.4 ...
 $ BMI.Category: Factor w/ 5 levels "0","1","2","3",..: 4 1 3 3 3 4 3 3 3 3 ...
 $ Team        : Factor w/ 332 levels "Acipactli","Afghanistan",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ Population  : int  8996973 8996973 8996973 8996973 8996973 8996973 8996973 8996973 8996973 8996973 ...
 $ GDP         : num  5.38e+08 5.38e+08 5.38e+08 5.38e+08 5.38e+08 ...
 $ GDPpC       : num  59.8 59.8 59.8 59.8 59.8 ...
 $ Games       : Factor w/ 30 levels "1960 Summer",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ Season      : Factor w/ 2 levels "Summer","Winter": 1 1 1 1 1 1 1 1 1 1 ...
 $ City        : Factor w/ 29 levels "Albertville",..: 19 19 19 19 19 19 19 19 19 19 ...
 $ Sport       : Factor w/ 51 levels "Alpine Skiing",..: 51 51 3 51 3 51 3 3 3 51 ...
 $ Event       : Factor w/ 489 levels "Alpine Skiing Men's Combined",..: 478 468 17 476 33 482 22 24 18 466 ...
 $ Medal       : Factor w/ 4 levels "Bronze","Gold",..: 3 3 3 3 3 3 3 3 3 3 ...
 $ Medal.No.Yes: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...

Done.

2 Time Series

2.1 Number of Events

Lets look to see

Number of sports per year at the https://www.topendsports.com/events/summer/sports/number.htm

There is clearly an upward trend, but no seasonal pattern. The data is also a little choppy at the beginning. Part of the explanation is that the data points are not evenly spaced. Most Olympic games are 4 years apart, but a few of them are just 2 years apart, and during World War I and World War II there were 8-year and 12-year gaps, respectively. Since time series data should be evenly spaced over time, we’ll only look at data from 1948 on, when the Olympics started being held every 4 years without any interruptions.

Lets see if I can build a time series using our data.

Time Series:
Start = 1948 
End = 2020 
Frequency = 1 
     Year Num.Sports
1948 1948         17
1949 1952         17
1950 1956         17
1951 1960         17
1952 1964         19
1953 1968         18
1954 1972         21
1955 1976         21
1956 1980         21
1957 1984         21
1958 1988         23
1959 1992         25
1960 1996         26
1961 2000         28
1962 2004         28
1963 2008         28
1964 2012         26
1965 2016         28
1966 2020         33
1967 1948         17
1968 1952         17
1969 1956         17
1970 1960         17
1971 1964         19
1972 1968         18
1973 1972         21
1974 1976         21
1975 1980         21
1976 1984         21
1977 1988         23
1978 1992         25
1979 1996         26
1980 2000         28
1981 2004         28
1982 2008         28
1983 2012         26
1984 2016         28
 [ reached getOption("max.print") -- omitted 36 rows ]

2.2 Creating the models

I’m going to try 4 different models.

\[ y_{\text{linear}}(x) = ax+b \\ y_{\text{quadratic}}(x) = ax^2 + bx + c \\ y_{\text{exponential}}(x) = a\exp(bx) + c \\ y_{\text{cubic}}(x) = ax^3 + bx^2 + cx + d \]

And I’ll be able to use ANOVA to test the nested models: linear vs quadratic, and exponential growth vs s-curve (sigmoid).

These models all look fairly similar. Lets check using ANOVA.

Res.Df Res.Sum Sq Df Sum Sq F value Pr(>F)
17 33.33158 NA NA NA NA
16 30.41244 1 2.919136 1.535759 0.2331213
16 32.28073 0 0.000000 NA NA
15 29.04971 1 3.231026 1.668361 0.2160269

Linear model preferred. Nothing gained from adding complexity to the model.

Lets look at the top 10 sports by number of participants.

                  Sport  freq
3             Athletics 19641
41             Swimming 14094
22           Gymnastics 13175
12 Cross Country Skiing  6134
1         Alpine Skiing  5649
14              Cycling  5567
31               Rowing  5325
34             Shooting  5307
17              Fencing  5073
11             Canoeing  4198
Sport freq
3 Athletics 19641
41 Swimming 14094
22 Gymnastics 13175
12 Cross Country Skiing 6134
1 Alpine Skiing 5649
14 Cycling 5567
31 Rowing 5325
34 Shooting 5307
17 Fencing 5073
11 Canoeing 4198

I need to subset the data because I keep getting the following error: “Error: vector memory exhausted (limit reached?)”. I will drop the following variables: NOC, Decade, ID, First.Name, Name, BMI, BMI.Category, Games, City, Event. I will only focus on the top ten sports.

[1] 151977
[1] 58693

Lets try making logistic regression models for Weight and Height.

  Year Mean_Weight StdDev_Weight Mean_Height StdDev_Height    Sport Sex
1 1924    64.00000      0.000000    167.0000      0.000000 Swimming   F
2 1956    61.00000      4.780914    169.7333      3.634491 Swimming   F
3 1960    62.73469      5.619073    169.3469      6.839076 Swimming   F
4 1964    63.06000      6.466270    171.3600      4.378799 Swimming   F
5 1968    62.45455      5.361348    170.3636      4.583033 Swimming   F
6 1972    60.23611      5.491333    170.3889      4.949194 Swimming   F
'data.frame':   339 obs. of  7 variables:
 $ Year         : int  1924 1956 1960 1964 1968 1972 1976 1980 1984 1988 ...
 $ Mean_Weight  : num  64 61 62.7 63.1 62.5 ...
 $ StdDev_Weight: num  0 4.78 5.62 6.47 5.36 ...
 $ Mean_Height  : num  167 170 169 171 170 ...
 $ StdDev_Height: num  0 3.63 6.84 4.38 4.58 ...
 $ Sport        : Factor w/ 10 levels "Basketball","Canoeing",..: 9 9 9 9 9 9 9 9 9 9 ...
 $ Sex          : Factor w/ 2 levels "F","M": 1 1 1 1 1 1 1 1 1 1 ...

  Year Mean_Weight StdDev_Weight Mean_Height StdDev_Height    Sport Sex
1 1924    64.00000      0.000000    167.0000      0.000000 Swimming   F
2 1956    61.00000      4.780914    169.7333      3.634491 Swimming   F
3 1960    62.73469      5.619073    169.3469      6.839076 Swimming   F
4 1964    63.06000      6.466270    171.3600      4.378799 Swimming   F
5 1968    62.45455      5.361348    170.3636      4.583033 Swimming   F
6 1972    60.23611      5.491333    170.3889      4.949194 Swimming   F
Medal mean
Bronze 25.55859
Gold 25.28269
No Medal 24.93049
Silver 25.48383
# A tibble: 6 x 3
# Groups:   Year [3]
   Year Sex   mean.Age
  <int> <fct>    <dbl>
1  1960 F         21.6
2  1960 M         26.0
3  1964 F         21.5
4  1964 M         25.7
5  1968 F         20.5
6  1968 M         25.1

2.3 Swimming

2.3.1 Models

2.3.1.1 Female Athletes

Res.Df Res.Sum Sq Df Sum Sq F value Pr(>F)
13 8.5085546 NA NA NA NA
12 3.3398817 1 5.168673 18.57074 0.0010150
12 8.1614545 0 0.000000 NA NA
11 0.6470143 1 7.514440 127.75427 0.0000002
Res.Df Res.Sum Sq Df Sum Sq F value Pr(>F)
13 11.697173 NA NA NA NA
12 5.842805 1 5.854368 12.02375 0.0046521
12 11.347295 0 0.000000 NA NA
11 2.432617 1 8.914677 40.31109 0.0000545
Res.Df Res.Sum Sq Df Sum Sq F value Pr(>F)
13 4.326164 NA NA NA NA
12 4.163567 1 0.162597 0.4686279 0.5066258
12 4.408289 0 0.000000 NA NA
11 2.732882 1 1.675407 6.7436063 0.0248333

2.3.1.2 Male Athletes

Res.Df Res.Sum Sq Df Sum Sq F value Pr(>F)
13 3.7207136 NA NA NA NA
12 1.8313700 1 1.889344 12.37987 0.0042351
12 3.5597205 0 0.000000 NA NA
11 0.4074947 1 3.152226 85.09186 0.0000016
Res.Df Res.Sum Sq Df Sum Sq F value Pr(>F)
13 13.008921 NA NA NA NA
12 12.927824 1 0.0810975 0.0752771 0.7884689
12 12.958849 0 0.0000000 NA NA
11 2.535541 1 10.4233087 45.2197021 0.0000327
Res.Df Res.Sum Sq Df Sum Sq F value Pr(>F)
13 10.265883 NA NA NA NA
12 4.739344 1 5.526539 13.99317 0.0028181
12 10.762827 0 0.000000 NA NA
11 3.544380 1 7.218447 22.40248 0.0006162

```

2.4 Athletics

2.5 Gymnastics

2.6 Rowing

2.7 Baskeball

2.8 Softball

2.9 Fencing

3 References

Izzy Illari

20 April, 2020